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Abstract 

We study the folding dynamics of polyalanine (Ala2o), a protein fragment with 20 residues 
whose native state is a single alpha helix. We use the CSAW model (conditioned self-avoiding 
walk), which treats the protein molecule as a chain in Brownian motion, with interactions that 
include hydrophobic forces and internal hydrogen bonding. We find that large scale structures 
form before small scale structures, and obtain the relevant relaxation times. We find that helix 
nucleation occurs at two separate points on the protein chain. The evolution of small and large 
scale structures involve different mechanisms. While the former can be describe by rate equations 
governing the growth of helical content, the latter is akin to the relaxation of an elastic solid. 

PACS numbers: 87.14.Ee, 87.15. Cc, 87.15.Aa, 87.15.He, 05.10.Ln 
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I. INTRODUCTION 

A protein is a chain of amino acids, referred to as residues, that folds into a characteristic 
shape in water at a sufficiently low temperature. The main force comes from the hydrophobic 
effect, which tends to drive hydrophobic residues to the interior of the folded conformation, 
or native state. Hydrogen bonding among residues leads to the formation of alpha helices 
and beta sheets. The native structure is usually described in terms of primary, secondary, 
and tertiary structures, which repsectively refer to the sequence of amino acids along the 
chain, the alpha helices and beta sheets mentioned, and the gross geometrical structurejl], 3]. 

An interesting question is whether secondary structure emerges before tertiary structure 
during folding. We try to answer the question for a protein fragment (peptide), polyalanine 
(Ala2o), which has 20 identical amino acids, the hydrophobic Alanine. The native state is 
known to be a single alpha helix. The tertiary structure, therefore, is a cylindrical tube. 

We use the CSAW (conditioned self-avoiding walk) model propose recently by one of 
usjsj]. The idea is that the unfolded protein is a random coil, which can be represented as a 
random walk that is not allowed to cross itself. Such a self-avoiding walk (SAW) simulates 
the fact that different residues cannot occupy the same location in space. The interactions 
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repsonsible for folding, chiefly the hydrophobic interaction and hydrogen-bonding, are taken 
into by imposing conditions on the SAW, hence the name CSAW. 

In a computer simulation, one generates an ensemble of SAW and extracts a sub-ensemble 
that satisfies desired conditions. The latter process is implemented through the Monte 
Carlo method, which can generate a sequence of states distributed according to a canonical 
ensemble. The conditions mentioned are expressed through various energy terms in the 
Hamiltonian. 

Mathematically, CSAW is a simulation of a Langevin equation [4] that describes the fold- 
ing protein chain. We refer to refs.jsj for details, but include a brief summary in the Ap- 
pendix, and illustrate its equivalence to molecular dynamics through a simple example. 

We find that, in the case considered, the overall size of the protein has reached an equi- 
librium value, while helical content continues to increase. In this sense, large scale structure 
forms before small scale structure. We will be able to see how the helix starts forming 
through nucleation. 

The time evolution of the large and small scale structures exhibit qualitatively different 
behaviors, which we can explain in terms of phenomenological models. The formation of 
small-scale secondary structures are governed by rate equations for the growth of helical 
content, while the relaxation of the large-scale size is analogous to that of an elastic solid. 

Since CSAW is a relatively new model, this work serves as a test of its validity. In this 
respect, the model appears to be effective in describing the dynamics of folding. When we 
affirm the model, we are affirming the underlying principle as implied by the Langevin equa- 
tion, namely, protein folding is a stochastic dissipative process that tends toward thermal 
equilibrium with the environment. 

From this point of view, the perennial debate on whether the folded state is in ther- 
dynamic equilibrium or in some kinetic steady state is merely a question of whether the 
protein can reach thermal equilibrium in realistic time, or gets trapped in some intermedi- 
ate state. The calculations here, using experimental input, indicate that a small protein like 
the present one reaches thermal equilibrium in the order of 20ns. It would be interesting to 
investgate this question for large proteins. 

This work is intended to address one aspect of the dynamical process of protein folding, 
instead of validating the model in a comprehensive way. Therefore, instead of putting 
everything into a comprehensive model, we focus on two most important interactions for 
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helix formation: hydrophobic and hydrogen bonding interactions. The Monte Carlo method 
used is based on a simulation of the Langevin equation, which is consistent to the physical 
model. Despite the simplicity, this model allows us to delineate the mechanisms of the 
formation of different scale structures. In the simple example, we find that the evolution of 
small and large scale structures involve different mechanisms. This, not necessary general, 
can give physical insight for wider understanding of protein folding as a stochastic process. 
Such physical insight will be helpful for further study, both theoretically and experimentally. 

II. METHODS AND RESULTS 

The initial state of the simulation was created by unfolding the native state of Ala2o 
through 4 x 10 4 CSAW steps, at a program temperature of T* = 4.4. The value was high 
enough to make the protein unfold into a random coil. We then set T* = 0.2, and the 
folding process began. We do not yet have a precise calibration of T* against the physical 
temperature. 

The folding process ran for 4 x 10 6 CSAW steps. The entire run was repeated 100 times 
to generate an ensemble of 100 folding trajectories. Each trajectory consumes the order of 
1 hr of computer time on a workstation. 

A CSAW step here means one Monte Carlo trial step, whether or not the trial results 
in a successful update. It simulates real time, during which the system tries to overcome 
energy barriers but does not always succeed. On average, it takes about 30 tries to achieve 
an update. 

According to analysis given later, which compares our results with experimental data, 
our run corresponds to about 10ns in physical time^, |f|. From this we estimate that one 
CSAW step corresponds to approximately 10~ 15 s. 

During folding, the helical content rises from an average initial value of 0.05 to 0.55, and 
tends toward an asymptote of 0.77. This indicates that we have not reached the native state. 
The ensemble generated is an evolving ensemble that is not yet canonical. This suits our 
purpose, which is to study the folding dynamics. Various relaxation times can be obtained 
by analyzing the evolution towards equilibrium. 

We follow the evolution at different length scales by measuring the average radius of 
gyration R g , average length of helix segments Lh, and average helicity They are defined 
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by 

1 N 

n,m=l 

Lh{1) — (Av. length of helix segment) . 
where N is the number of residues, R n (t) is the instantaneous position of the center of 



/No. residues in helix \ 
Mt) = \ Maximum No. ' ' ( ) 



the nth residue, and () denotes ensemble average at time t. These quantities respectively 
measure structures on the largest, intermediate, and smallest length scales. 
We also measure the structure factor 

1 N 

^ ^ = nY1 < ex P (* k • [ R «W " R ™(*)D> > ( 2 ) 

n,m=l 

which is the Fourier transform of the density correlation function, accessible to experiments 
through x-ray scattering. It is independent of the directions of k because of the ensemble 
average Q]. 

FigJT] shows distributions of R g , L H , f H at different times. Fig [2] shows R g , L H , f H as 
functions of time on a log scale. Lines through the calculated points are fit made to obtain 
relaxation times, to be detailed later. 

The behavior of R a shows that there was a very fast collapse, followed by two slower 

nn 

stages. Such a two-stage behavior has been observed experimentally in larger proteins [8, 9|. 
We shall analyze them later in terms of theoretical models. 

As we can see from Figl21 there is little change in R g apart from fluctuations after Ins, 
but Lh and /# continue to increase. This means that, while the overall size of the protein 
has equilibriated, the secondary structures continue to adjust. We shall quantify this in 
terms of relaxation times. 

FigJH] shows g(k) as functions of k, for different times (see caption for detail). 

FigfJ] shows a contour map of the ensemble average of local helicity. The vertical axis is 
residue number, and the horizontal axis is time. Helix nucleation started near residues 6 and 
14. Since the contour plot here is an ensemble average, this indicated that the nucleation 
points are not random, but occur at specific positions, at least for this small protein. 
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III. ANALYSIS AND DISCUSSION 



The existence of two folding stages suggest that we fit the late-time evolutions with two 
exponential functions. Indeed, we obtain good fits for LhJh with 

L H (t) = 7.43 - 0.98e-'/°- 17 - 2.95e-'/ 4 - 68 , 

W 3 

f H (t) = 0.77 - 0.13e-'/°- 17 - 0.55e"*/ 4 - 68 . 
where the unit for t is ns. These are shown as solid curves in Figj2] They suggest that the 
ensemble will reach equilibrium at 20ns, with average helical content 0.77. The relaxation 
times for the two stages are 0.17ns and 4.68ns, respectively. 

The time scale is determined as follows. Originally t was measured in CSAW steps. We 
judge that at the end of our runs the folding process was about 70% complete, and that 
puts the halfway j3oint at about 3xl0 6 steps. Identifying this with the experimental value 
of ti/2 = 16ns 



, we arrive at the estimate of approximately 10 _15 s per step. 
The two-stage behavior of the development of secondary structure suggests the following 
model. We picture the ensemble as a mixture of three classes of protein chains: unfolded 
(U) with f H < 02, intermediate (I) with 0.2 < f H < 0.5, and folded (F) with f H > 0.5. 
There are three-state transitions among these classes: 

U ^ I ^ F . (4) 

The relative fractions of these classes evolve with time. These fractions can be obtained 
by solving rate equations, using time constants given prevously. They can of course be 
extracted from our simulation data. FigJS] shows that the two agree rather well. 

The two-exponential fit does not work for R g , as we see by the dashed curve in Fig 151 
Thus, the relaxation of of R g calls for a different mechanism. For this, we model the gross 
structure as an elastic solid, with an effective potential energy 

■*>-(*)■" -(*)"• 

and a phenomenological equation of motion 

dR g dV 

where t is in ns, R g in A. Solving the equation with 7 = 3.33, A = 9.58, B = 14.35, initial 
condition R g (0) = 8.7, we obtain the solid curve in FigJ2l which gives a good fit to the 
simulation data. 
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FigfS] shows that R g reaches equilibrium after about 1 ns, at a value slightly lower than 
the data points. This suggests that there are perturbations to radius relaxation from the 
secondary structure, which continues to undergo adjustment. 



IV. RELATION TO OTHER WORKS 



In our ensemble, we find that there are two types of evolution paths, a fast and a slow one, 
as illustrated in FigJBl This supports results from discontinuous molecular dyanmics[10]. We 
have found the same fast and slow paths in the folding of chignolin, a 10-residue synthetic 



peptide 



1XQ . The meaning of this is not yet clear. 



In an early work on alpha-helix formation, Zimm and Bragg introduced parameters s and 
a, which respectively measures the probability of helix growth and nucleation. The quantities 
of these two parameters are related to ju and Lh according to following equations [131]. 

f = - - = (7) 

JH ' 2 2^(1 -a) 2 + 4*r 1 1 

L H = 1 + 28 (8) 

l-s + y/(l -s) 2 + Asa 

We have calculated these quantities using CSAW. An advantage in our model is that we 
can turn on or off selected interactions. In FigJ7]we show the results with and without the 
hydrophobic effect. We can see that the hydrophobic effect has pronounced influence on 
helix nucleation, but it is not as important for helix growth. 



APPENDIX A: CSAW (CONDITIONED SELF- AVOIDING WALK) 

The CSAW model is an algorithm that successively updates a protein state, in order to 
generate a canonical ensemble of states. Starting with an initial state which is an arbitrary 
non-overlapping chain (SAW), we generate a new chain by the pivoting algorithm, and 
keep doing so until we obtain another non-overlapping chain (a new SAW). We then decide 
whether to accept this as an update via the Metropolis Monte Carlo method, as follows. 
We ask whether the proposed update decreases the energy E. If it does we accept it, and 
otherwise accept it with a relative probability given by the Boltzmann factor 

p = exp (- (£ new - Eoi d ) I k B T) . 
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That the energy can increase simulates thermal fluctuations, and makes the updating process 
one of minimizing the free energy. 

Along the backbone of the protein chain a series of carbon atoms (the C a ) are connected 
by covalent chemical bonds that are shaped like a crank, which lies in one plane. The major 
degrees of freedom of the chain are the torsional angles that define the relative orientation 
of two successive planes. Other degrees of freedom, such as small vibrations of the chemical 
bonds, can be neglected when we consider protein folding. The state of a protein of N 
residues are thus specified by N — 1 pairs of torsional angles. These are the only degrees of 
freedom considered. 

For the present study, side chains are approximated by hard spheres, and other atoms are 
treated as hard spheres with known van der Waals radii. The only interactions included are 
those corresponding to the hydrophobic effect, and hydrogen bonding. The energy is taken 
to be 

E = -g 1 K 1 -g 2 K 2 , (Al) 

where: K\ is the total hydorphobic contact number, i.e., the total number of nearest neigh- 
bors surrounding a hydrophobic residue, (not counting the two permanent nearest neighbors 
along the chain.) In the present case, all residues are hydrophobic. The quantity K 2 is the 
total number of internal hydrogen bonds, which connects hydrogen to oxygen in different 
residues. Such a bond is deemed to exist whenever the partner are within a certain range 
of distrance from each other, and the chemical bonds they are attached to are antiparallel 
within given margins. 

In general, the clear separation of hydrophobic effect and hydrogen bonding is an approx- 
imation, under the assumption that atoms on the backbone can only bond with one another, 
while those on the side chains can only bond with water. In the present case this distinction 
is moot, since all rersidues are hydrophobic, whose side chains cannot form hydrogen bonds. 

There are two independent parameters g\ and g 2 in the model. Actually, in the Monte 
Carlo procedure, only the combinations gi/ksT and g 2 /ksT are relevant. We define a 
program temperature T* = ksT/g 2 , and use T* and gi/g 2 as independent parameters. To 
simplify the notation, we set g 2 = 1. To fix the parameters, we calculate the helical content 
fa for various values, and choose that which gives the maximum helicity after 10 6 CSAW 
steps. The contour map of /# so obtained is shown in FigJHJ from which we pick the values 
g x = 0.05, T* = 0.2. 
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The CSAW model is a computer simulation of a generalized Langevin equation for the 
protein chain, which has the form 



m k r k = -m k j h r k + F k (t) + G k (r 1} ■ ■ ■ ,r N ), (k = 1, • • • , N) (A2) 

where rj is the position of the ith atom, m k its mass, 7^ its dissipation coefficient, and F k 
the random force acting on it by the medium. The term G k is a symbolic representation of 
all the non-random forces acting on the atom, including interaction with other atoms in the 
protein, the chemical bonds that hold the chain together, and the hydrophobic interaction 
with the medium. It is highly improbable that we can solve this equation analytically, but 
we can simulate on a computer. The dissipation and random force is simulated by random 
walk, and the forces in that maintain the chain and prevent atoms from overlapping 
make the random walk a SAW. The rest of the forces in G^ is taken into account through 
Monte Carlo. 

To illustrate that this procedure yields a solution of the equation, we consider a simpler 
case, the Brownian motion of a particle in ID, in a potential well. The Lagevin equation 
reads 

„ = _dU^x)_ _ ± + p ^ (A3) 
ax 

with a double-well potential 

U(x) = a Q* 2 - i* 3 ) + Q* 4 - ^ , (A4) 

which is sketched in FigJH], with a = 3.5. The equation is then solved as a stochastic differ- 
entiation using molecular dynamics, and alternatively using Monte Carlo. Comparison of 
these two methods are given in FigJTU| which show the equivalence in a statistical sense. 
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FIGURE CAPTIONS 



Fig.l Distribution function for (A) radius of gyration R g , (B) average length of helix 
segment Lh, (C) local helicity /# . They pertain to structures on large, intermediate and 
small length scales, respectively. The distributions are shown for three different times during 
folding, which lasts 4 ns. 

Fig. 2. Time evolution of (A) radius of gyration R g , (B) average length of helix segment 
Lh, (C) local helicity fn- For (B) and (C), the solid curves are fits by a sum of two 
exponentials, which can be derived from rate equations governing the growth of helicity. 
The two-exponential for (A), shown as the dashed curve, is not satisfactory. Instead, a 
better fit (solid curve) is obtained via a model that treats the protein as an elastic solid. 
This shows that the relaxation of large and small scale structures are governed by different 
mechanisms. 

Fig. 3. The structure function g(k) is the Fourier tranform of the density correlation 
function, and contains information about structures on different length scales. It is shown 
at different times during the folding process. The inset show the time evolution at two 
specific wave numbers k, corresponding respectively to large (dashed curve) and small (solid 
curve) scale structures. 

Fig. 4. Contour map of ensemble average of local helicity. Vertical axis is residue sequence 
along the protein chain, and horizontal axis is time. Arrows point to points of nucleation of 
helical structure. 

Fig. 5. The ensemble can be divided into classes of protein chains characterized by dif- 
ferent helical content: U-unfolded, I-intermediate, F-folded. Data points are from CSAW 
simulation, and solid curves are calculated from phenomenological rate equations governing 
transitions among these classes. The rate equations underlie the two-exponental fits in (B) 
and (C) of FigfJJ 

Fig. 6. The evolution of helicity reveals two types of folding paths, fast and slow. This 
reproduces results of other works, but its significance is yet to be understood. 

Fig. 7. Evolution of the probability of helix growth (s), and helix nucleation (a). The 
hydorphobic effect is turned on in the upper curves, and off in the lower cureves. It has a 
more prounced effect on nucleation than on growth. 
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Fig.8. Contour map of average helicity as functions of CSAW parameters. We choose the 
set of parameters at the maximum helicity. 

Fig.9. Double-well potential used in the Lagevin equation in illustrative calculations. 

Fig. 10. Demonstration of the statistical equivalence of MC (Monte Carlo) and MD 
(molecular dynamics) in solving the illustrative Langevin equation. (A) and (B) show the 
average position and standard deviation as functions of time. (C) and (D) show sample 
paths from MC and MD respectively. We can see that the position x go over the energy 
barrier to visit the origin, but at different times. 
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